AQI EDA Example

In this exercise adapted from Nelson (2020), we will perform exploratory data analysis (EDA) using air quality index (AQI) data for North Carolina during the Peat Bog Fires of 2008.

Learning Objectives

  • Review EDA Goals
  • Understand AQI and how it is calculated
  • Apply EDA to the AQI data
  • Evaluate how spatial resolution can influence air quality data

EDA

Recall that EDA is the process of exploring your data: examining the composition of a dataset, the distributions of individual variables, and relationships between variables. It can help achieve the following goals:

  • Helps check for errors/issues in data
  • Determine if the question you’re asking can be answered with data
  • Helps to formulate approach for answering question(s)

Case Study - 2008 NC Peat Bog Fires

In 2008, burning peat bogs produced haze and air pollution far in excess of national air quality standards in NC (Rappold et al. 2011; Fig. 1).

Fig. 1 from Rappold et al. (2011)

Fig. 1 from Rappold et al. (2011)

Rappold et al. (2011) further labeled counties in NC as “exposed” or “reference” based on whether or not they were exposed to the plume based on satellite imagery:

Fig. 1 from Rappold et al. (2011)

Fig. 1 from Rappold et al. (2011)

The study by Rappold et al. (2011) concluded that in exposed counties, there were significant increases in cumulative relative risks for asthma, chronic obstructive pulmonary disease, and pneumonia and acute bronchitis. Emergency department visits for cardio-pulmonary symptoms and heart failure were also increased. This was due to even a brief (3-day exposure).

General EDA Question

What were the on-the-ground or “in situ” air quality conditions in NC during the peat fires of 2008?

Specific questions we will explore - Among the counties where air quality data are available:

  • what were the maximum air quality index values observed in “reference counties” during the peat fires of 2008?
  • what were the maximum air quality index values observed in “exposed counties” during the peat fires of 2008?
  • what were the maximum air quality index values observed in all NC counties not considered in the Rappold et al. (2011) analysis during the peat fires of 2008?

What is the Air Quality Index?

The Air Quality Index (AQI) can be 0 - 500. The higher the AQI value, the greater level of air pollution. AQI values for five major air pollutants regulated by the Clean Air Act (ground level ozone, particle pollution, carbon monoxide, sulfur dioxide, nitrogen dioxide) are available. An acceptable AQI is < 100. Various levels of concern match different ranges:

AQI concern ranges

AQI concern ranges

Daily AQI values can be downloaded from the EPA https://www.epa.gov/outdoor-air-quality-data. For more information visit https://www.airnow.gov/

General EDA Approach

We will specifically focus on the AQI for fine particulate matter, measured as particles 2.5 um or smaller in diameter (PM2.5). We’re focusing on this AQI as fine particulate matter is the primary pollutant of concern in wildfire smoke.

While Rappold et al. (2011) focused on exposure to plume based on satellite images (see above figures), we want to explore the actual air quality data as well in these counties.

We’ll begin our EDA with understanding the dataset itself, subsetting to NC counties, and then looking at AQI during the fire time periods (June 10-12, 2008) for counties identified as “exposed” or “reference” by Rappold et al. (2011).

Let’s Begin!

First let’s load some packages we’ll need

rm(list=ls(all=TRUE)) # Clear workspace
require(tidyverse)
## Loading required package: tidyverse
## Warning: package 'tidyverse' was built under R version 3.6.3
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.1     v dplyr   1.0.6
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.1
## Warning: package 'tibble' was built under R version 3.6.3
## Warning: package 'tidyr' was built under R version 3.6.3
## Warning: package 'readr' was built under R version 3.6.3
## Warning: package 'purrr' was built under R version 3.6.3
## Warning: package 'dplyr' was built under R version 3.6.3
## Warning: package 'forcats' was built under R version 3.6.3
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
require(ggthemes)
## Loading required package: ggthemes
## Warning: package 'ggthemes' was built under R version 3.6.3

Next, let’s read in our data and check it out:

aqi <- read.csv("daily_aqi_2008.csv")
str(aqi)
## 'data.frame':    150453 obs. of  32 variables:
##  $ state_code         : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ county_code        : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ site_num           : int  10 10 10 10 10 10 10 10 10 10 ...
##  $ parameter_code     : int  88101 88101 88101 88101 88101 88101 88101 88101 88101 88101 ...
##  $ poc                : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ latitude           : num  30.5 30.5 30.5 30.5 30.5 ...
##  $ longitude          : num  -87.9 -87.9 -87.9 -87.9 -87.9 ...
##  $ datum              : Factor w/ 2 levels "NAD83","WGS84": 1 1 1 1 1 1 1 1 1 1 ...
##  $ parameter_name     : Factor w/ 1 level "PM2.5 - Local Conditions": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sample_duration    : Factor w/ 3 levels "1 HOUR","24-HR BLK AVG",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ pollutant_standard : Factor w/ 1 level "PM25 24-hour 2012": 1 1 1 1 1 1 1 1 1 1 ...
##  $ date_local         : Factor w/ 366 levels "2008-01-01","2008-01-02",..: 4 7 10 13 16 19 22 25 28 31 ...
##  $ year               : int  2008 2008 2008 2008 2008 2008 2008 2008 2008 2008 ...
##  $ month              : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ day                : int  4 7 10 13 16 19 22 25 28 31 ...
##  $ units_of_measure   : Factor w/ 1 level "Micrograms/cubic meter (LC)": 1 1 1 1 1 1 1 1 1 1 ...
##  $ event_type         : Factor w/ 3 levels "Excluded","Included",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ observation_count  : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ observation_percent: int  100 100 100 100 100 100 100 100 100 100 ...
##  $ arithmetic_mean    : num  9.1 3.7 3.6 9 7.4 5.9 12.2 6 8.6 5.5 ...
##  $ x1st_max_value     : num  9.1 3.7 3.6 9 7.4 5.9 12.2 6 8.6 5.5 ...
##  $ x1st_max_hour      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ aqi                : int  38 15 15 38 31 25 51 25 36 23 ...
##  $ method_code        : int  118 118 118 118 118 118 118 118 118 118 ...
##  $ method_name        : Factor w/ 13 levels "-","Andersen RAAS2.5-100 PM2.5 SAM w/WINS - GRAVIMETRIC",..: 10 10 10 10 10 10 10 10 10 10 ...
##  $ local_site_name    : Factor w/ 878 levels "2ZR SITE MOVED FROM RIO RANCHO CITY HALL TO SENIOR CENTER",..: 253 253 253 253 253 253 253 253 253 253 ...
##  $ address            : Factor w/ 947 levels "1 COURT HOUSE SQUARE",..: 684 684 684 684 684 684 684 684 684 684 ...
##  $ state_name         : Factor w/ 53 levels "Alabama","Alaska",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ county_name        : Factor w/ 500 levels "Ada","Adair",..: 25 25 25 25 25 25 25 25 25 25 ...
##  $ city_name          : Factor w/ 651 levels "Aberdeen","Adjuntas",..: 188 188 188 188 188 188 188 188 188 188 ...
##  $ cbsa_name          : Factor w/ 370 levels "Aberdeen, SD",..: 83 83 83 83 83 83 83 83 83 83 ...
##  $ date_of_last_change: Factor w/ 160 levels "2015-04-01","2015-04-10",..: 1 1 1 1 1 1 1 1 1 1 ...
colnames(aqi)
##  [1] "state_code"          "county_code"         "site_num"           
##  [4] "parameter_code"      "poc"                 "latitude"           
##  [7] "longitude"           "datum"               "parameter_name"     
## [10] "sample_duration"     "pollutant_standard"  "date_local"         
## [13] "year"                "month"               "day"                
## [16] "units_of_measure"    "event_type"          "observation_count"  
## [19] "observation_percent" "arithmetic_mean"     "x1st_max_value"     
## [22] "x1st_max_hour"       "aqi"                 "method_code"        
## [25] "method_name"         "local_site_name"     "address"            
## [28] "state_name"          "county_name"         "city_name"          
## [31] "cbsa_name"           "date_of_last_change"
dim(aqi)
## [1] 150453     32
# Look at the top and bottom of your data ----
head(aqi)
##   state_code county_code site_num parameter_code poc latitude longitude datum
## 1          1           3       10          88101   1 30.49748 -87.88026 NAD83
## 2          1           3       10          88101   1 30.49748 -87.88026 NAD83
## 3          1           3       10          88101   1 30.49748 -87.88026 NAD83
## 4          1           3       10          88101   1 30.49748 -87.88026 NAD83
## 5          1           3       10          88101   1 30.49748 -87.88026 NAD83
## 6          1           3       10          88101   1 30.49748 -87.88026 NAD83
##             parameter_name sample_duration pollutant_standard date_local year
## 1 PM2.5 - Local Conditions         24 HOUR  PM25 24-hour 2012 2008-01-04 2008
## 2 PM2.5 - Local Conditions         24 HOUR  PM25 24-hour 2012 2008-01-07 2008
## 3 PM2.5 - Local Conditions         24 HOUR  PM25 24-hour 2012 2008-01-10 2008
## 4 PM2.5 - Local Conditions         24 HOUR  PM25 24-hour 2012 2008-01-13 2008
## 5 PM2.5 - Local Conditions         24 HOUR  PM25 24-hour 2012 2008-01-16 2008
## 6 PM2.5 - Local Conditions         24 HOUR  PM25 24-hour 2012 2008-01-19 2008
##   month day            units_of_measure event_type observation_count
## 1     1   4 Micrograms/cubic meter (LC)       None                 1
## 2     1   7 Micrograms/cubic meter (LC)       None                 1
## 3     1  10 Micrograms/cubic meter (LC)       None                 1
## 4     1  13 Micrograms/cubic meter (LC)       None                 1
## 5     1  16 Micrograms/cubic meter (LC)       None                 1
## 6     1  19 Micrograms/cubic meter (LC)       None                 1
##   observation_percent arithmetic_mean x1st_max_value x1st_max_hour aqi
## 1                 100             9.1            9.1             0  38
## 2                 100             3.7            3.7             0  15
## 3                 100             3.6            3.6             0  15
## 4                 100             9.0            9.0             0  38
## 5                 100             7.4            7.4             0  31
## 6                 100             5.9            5.9             0  25
##   method_code                                            method_name
## 1         118 R & P Model 2025 PM2.5 Sequential w/WINS - GRAVIMETRIC
## 2         118 R & P Model 2025 PM2.5 Sequential w/WINS - GRAVIMETRIC
## 3         118 R & P Model 2025 PM2.5 Sequential w/WINS - GRAVIMETRIC
## 4         118 R & P Model 2025 PM2.5 Sequential w/WINS - GRAVIMETRIC
## 5         118 R & P Model 2025 PM2.5 Sequential w/WINS - GRAVIMETRIC
## 6         118 R & P Model 2025 PM2.5 Sequential w/WINS - GRAVIMETRIC
##     local_site_name                                                  address
## 1 FAIRHOPE, Alabama FAIRHOPE HIGH SCHOOL, 1 PIRATE DRIVE, FAIRHOPE,  ALABAMA
## 2 FAIRHOPE, Alabama FAIRHOPE HIGH SCHOOL, 1 PIRATE DRIVE, FAIRHOPE,  ALABAMA
## 3 FAIRHOPE, Alabama FAIRHOPE HIGH SCHOOL, 1 PIRATE DRIVE, FAIRHOPE,  ALABAMA
## 4 FAIRHOPE, Alabama FAIRHOPE HIGH SCHOOL, 1 PIRATE DRIVE, FAIRHOPE,  ALABAMA
## 5 FAIRHOPE, Alabama FAIRHOPE HIGH SCHOOL, 1 PIRATE DRIVE, FAIRHOPE,  ALABAMA
## 6 FAIRHOPE, Alabama FAIRHOPE HIGH SCHOOL, 1 PIRATE DRIVE, FAIRHOPE,  ALABAMA
##   state_name county_name city_name                 cbsa_name
## 1    Alabama     Baldwin  Fairhope Daphne-Fairhope-Foley, AL
## 2    Alabama     Baldwin  Fairhope Daphne-Fairhope-Foley, AL
## 3    Alabama     Baldwin  Fairhope Daphne-Fairhope-Foley, AL
## 4    Alabama     Baldwin  Fairhope Daphne-Fairhope-Foley, AL
## 5    Alabama     Baldwin  Fairhope Daphne-Fairhope-Foley, AL
## 6    Alabama     Baldwin  Fairhope Daphne-Fairhope-Foley, AL
##   date_of_last_change
## 1          2015-04-01
## 2          2015-04-01
## 3          2015-04-01
## 4          2015-04-01
## 5          2015-04-01
## 6          2015-04-01
tail(aqi)
##        state_code county_code site_num parameter_code poc latitude longitude
## 150448         78          10       12          88101   2 17.71247 -64.78487
## 150449         78          10       12          88101   2 17.71247 -64.78487
## 150450         78          10       12          88101   2 17.71247 -64.78487
## 150451         78          10       12          88101   2 17.71247 -64.78487
## 150452         78          10       12          88101   2 17.71247 -64.78487
## 150453         78          10       12          88101   2 17.71247 -64.78487
##        datum           parameter_name sample_duration pollutant_standard
## 150448 WGS84 PM2.5 - Local Conditions         24 HOUR  PM25 24-hour 2012
## 150449 WGS84 PM2.5 - Local Conditions         24 HOUR  PM25 24-hour 2012
## 150450 WGS84 PM2.5 - Local Conditions         24 HOUR  PM25 24-hour 2012
## 150451 WGS84 PM2.5 - Local Conditions         24 HOUR  PM25 24-hour 2012
## 150452 WGS84 PM2.5 - Local Conditions         24 HOUR  PM25 24-hour 2012
## 150453 WGS84 PM2.5 - Local Conditions         24 HOUR  PM25 24-hour 2012
##        date_local year month day            units_of_measure event_type
## 150448 2008-10-09 2008    10   9 Micrograms/cubic meter (LC)       None
## 150449 2008-10-21 2008    10  21 Micrograms/cubic meter (LC)       None
## 150450 2008-11-02 2008    11   2 Micrograms/cubic meter (LC)       None
## 150451 2008-11-14 2008    11  14 Micrograms/cubic meter (LC)       None
## 150452 2008-11-26 2008    11  26 Micrograms/cubic meter (LC)       None
## 150453 2008-12-08 2008    12   8 Micrograms/cubic meter (LC)       None
##        observation_count observation_percent arithmetic_mean x1st_max_value
## 150448                 1                 100             9.2            9.2
## 150449                 1                 100            13.7           13.7
## 150450                 1                 100             8.1            8.1
## 150451                 1                 100             2.4            2.4
## 150452                 1                 100             3.3            3.3
## 150453                 1                 100             4.5            4.5
##        x1st_max_hour aqi method_code
## 150448             0  38         116
## 150449             0  54         116
## 150450             0  34         116
## 150451             0  10         116
## 150452             0  14         116
## 150453             0  19         116
##                                               method_name local_site_name
## 150448 BGI Model PQ200 PM2.5 Sampler w/WINS - GRAVIMETRIC            <NA>
## 150449 BGI Model PQ200 PM2.5 Sampler w/WINS - GRAVIMETRIC            <NA>
## 150450 BGI Model PQ200 PM2.5 Sampler w/WINS - GRAVIMETRIC            <NA>
## 150451 BGI Model PQ200 PM2.5 Sampler w/WINS - GRAVIMETRIC            <NA>
## 150452 BGI Model PQ200 PM2.5 Sampler w/WINS - GRAVIMETRIC            <NA>
## 150453 BGI Model PQ200 PM2.5 Sampler w/WINS - GRAVIMETRIC            <NA>
##                                      address     state_name county_name
## 150448 BETHLEHEM VILLAGE HOUSING MGMT OFFICE Virgin Islands    St Croix
## 150449 BETHLEHEM VILLAGE HOUSING MGMT OFFICE Virgin Islands    St Croix
## 150450 BETHLEHEM VILLAGE HOUSING MGMT OFFICE Virgin Islands    St Croix
## 150451 BETHLEHEM VILLAGE HOUSING MGMT OFFICE Virgin Islands    St Croix
## 150452 BETHLEHEM VILLAGE HOUSING MGMT OFFICE Virgin Islands    St Croix
## 150453 BETHLEHEM VILLAGE HOUSING MGMT OFFICE Virgin Islands    St Croix
##            city_name cbsa_name date_of_last_change
## 150448 Not in a city      <NA>          2015-04-01
## 150449 Not in a city      <NA>          2015-04-01
## 150450 Not in a city      <NA>          2015-04-01
## 150451 Not in a city      <NA>          2015-04-01
## 150452 Not in a city      <NA>          2015-04-01
## 150453 Not in a city      <NA>          2015-04-01
#date is read as a factor so let's fix
aqi$date_local<-as.Date(aqi$date_local)

Next, let’s summarize our data. Once you summarize the AQI, do a “domain knowledge” check. Do these values make sense based on what we learned about the AQI? If so, move on and understand what the time and spatial domains of the data are.

summary(aqi)
##    state_code     county_code        site_num      parameter_code 
##  Min.   : 1.00   Min.   :  1.00   Min.   :   1.0   Min.   :88101  
##  1st Qu.:13.00   1st Qu.: 25.00   1st Qu.:   7.0   1st Qu.:88101  
##  Median :26.00   Median : 61.00   Median :  23.0   Median :88101  
##  Mean   :26.99   Mean   : 81.63   Mean   : 669.7   Mean   :88101  
##  3rd Qu.:40.00   3rd Qu.:101.00   3rd Qu.:1002.0   3rd Qu.:88101  
##  Max.   :78.00   Max.   :810.00   Max.   :9997.0   Max.   :88101  
##                                                                   
##       poc           latitude       longitude         datum       
##  Min.   :1.000   Min.   :17.71   Min.   :-158.09   NAD83: 26626  
##  1st Qu.:1.000   1st Qu.:34.30   1st Qu.:-103.48   WGS84:123827  
##  Median :1.000   Median :39.23   Median : -86.25                 
##  Mean   :1.156   Mean   :37.74   Mean   : -92.68                 
##  3rd Qu.:1.000   3rd Qu.:41.55   3rd Qu.: -80.09                 
##  Max.   :7.000   Max.   :64.85   Max.   : -64.78                 
##                                                                  
##                   parameter_name        sample_duration  
##  PM2.5 - Local Conditions:150453   1 HOUR       :  4987  
##                                    24-HR BLK AVG:  3998  
##                                    24 HOUR      :141468  
##                                                          
##                                                          
##                                                          
##                                                          
##          pollutant_standard   date_local              year     
##  PM25 24-hour 2012:145466   Min.   :2008-01-01   Min.   :2008  
##  NA's             :  4987   1st Qu.:2008-04-06   1st Qu.:2008  
##                             Median :2008-07-08   Median :2008  
##                             Mean   :2008-07-05   Mean   :2008  
##                             3rd Qu.:2008-10-06   3rd Qu.:2008  
##                             Max.   :2008-12-31   Max.   :2008  
##                                                                
##      month             day                           units_of_measure 
##  Min.   : 1.000   Min.   : 1.00   Micrograms/cubic meter (LC):150453  
##  1st Qu.: 4.000   1st Qu.: 8.00                                       
##  Median : 7.000   Median :16.00                                       
##  Mean   : 6.657   Mean   :15.89                                       
##  3rd Qu.:10.000   3rd Qu.:24.00                                       
##  Max.   :12.000   Max.   :31.00                                       
##                                                                       
##     event_type     observation_count observation_percent arithmetic_mean   
##  Excluded:  1093   Min.   : 1.00     Min.   :  4.0       Min.   : -0.9792  
##  Included:  4768   1st Qu.: 1.00     1st Qu.:100.0       1st Qu.:  6.2000  
##  None    :144592   Median : 1.00     Median :100.0       Median :  9.4000  
##                    Mean   : 1.72     Mean   :101.1       Mean   : 11.0152  
##                    3rd Qu.: 1.00     3rd Qu.:100.0       3rd Qu.: 14.1000  
##                    Max.   :24.00     Max.   :200.0       Max.   :200.2000  
##                                                                            
##  x1st_max_value   x1st_max_hour          aqi          method_code   
##  Min.   : -0.90   Min.   : 0.0000   Min.   :  0.00   Min.   :116.0  
##  1st Qu.:  6.30   1st Qu.: 0.0000   1st Qu.: 26.00   1st Qu.:118.0  
##  Median :  9.60   Median : 0.0000   Median : 40.00   Median :118.0  
##  Mean   : 11.54   Mean   : 0.3885   Mean   : 41.78   Mean   :126.9  
##  3rd Qu.: 14.40   3rd Qu.: 0.0000   3rd Qu.: 55.00   3rd Qu.:145.0  
##  Max.   :614.00   Max.   :23.0000   Max.   :250.00   Max.   :184.0  
##                                     NA's   :4987     NA's   :3998   
##                                                               method_name   
##  R & P Model 2025 PM2.5 Sequential w/WINS - GRAVIMETRIC             :68256  
##  R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC - Gravimetric:31049  
##  Andersen RAAS2.5-300 PM2.5 SEQ w/WINS - GRAVIMETRIC                :23038  
##  R & P Model 2000 PM2.5 Sampler w/WINS - GRAVIMETRIC                : 6653  
##  Met One BAM-1020 Mass Monitor w/VSCC - Beta Attenuation            : 4869  
##  -                                                                  : 3998  
##  (Other)                                                            :12590  
##                       local_site_name  
##  Pahala                       :  1494  
##  Kona                         :  1476  
##  Oldtown                      :   842  
##  NORTH END SITE CENTRAL ARTERY:   719  
##  Mountain View 17             :   706  
##  (Other)                      :136433  
##  NA's                         :  8783  
##                                      address                state_name    
##  96-3150 PIKAKE ST                       :  1494   California    : 11973  
##  81-1043 KONAWAENA SCHOOL RD             :  1476   Pennsylvania  :  8399  
##  Oldtown Fire Station, 1100 Hillen Street:   842   Florida       :  7148  
##  174 NORTH ST                            :   719   Indiana       :  6578  
##  17-860 Volcano Road                     :   706   Ohio          :  6391  
##  500 SOUTH BROAD STREET-PARKING LOT (CHS):   703   North Carolina:  5782  
##  (Other)                                 :144513   (Other)       :104182  
##        county_name             city_name     
##  Hawaii      :  4346   Not in a city: 14491  
##  Jefferson   :  3961   Philadelphia :  2490  
##  Philadelphia:  2490   New York     :  1808  
##  Los Angeles :  2003   Pahala       :  1494  
##  Cook        :  1626   Kealakekua   :  1476  
##  Hamilton    :  1620   Baltimore    :  1427  
##  (Other)     :134407   (Other)      :127267  
##                                        cbsa_name      date_of_last_change
##  Philadelphia-Camden-Wilmington, PA-NJ-DE-MD:  4754   2015-04-01:115135  
##  New York-Newark-Jersey City, NY-NJ-PA      :  4400   2015-08-11:  3814  
##  Hilo, HI                                   :  4346   2018-05-21:  1653  
##  Chicago-Naperville-Elgin, IL-IN-WI         :  3530   2017-04-25:   972  
##  St. Louis, MO-IL                           :  2897   2016-01-07:   676  
##  (Other)                                    :123894   2018-10-01:   670  
##  NA's                                       :  6632   (Other)   : 27533
summary(aqi$aqi) # this is the actual AQI variable
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00   26.00   40.00   41.78   55.00  250.00    4987
range(aqi$date_local)# Makes sense, spans one year
## [1] "2008-01-01" "2008-12-31"
unique(aqi$state_name) # There are 53: states + Puerto Rico + Virgin Islands + DC
##  [1] Alabama              Alaska               Arizona             
##  [4] Arkansas             California           Colorado            
##  [7] Connecticut          Delaware             District Of Columbia
## [10] Florida              Georgia              Hawaii              
## [13] Idaho                Illinois             Indiana             
## [16] Iowa                 Kansas               Kentucky            
## [19] Louisiana            Maine                Maryland            
## [22] Massachusetts        Michigan             Minnesota           
## [25] Mississippi          Missouri             Montana             
## [28] Nebraska             Nevada               New Hampshire       
## [31] New Jersey           New Mexico           New York            
## [34] North Carolina       North Dakota         Ohio                
## [37] Oklahoma             Oregon               Pennsylvania        
## [40] Rhode Island         South Carolina       South Dakota        
## [43] Tennessee            Texas                Utah                
## [46] Vermont              Virginia             Washington          
## [49] West Virginia        Wisconsin            Wyoming             
## [52] Puerto Rico          Virgin Islands      
## 53 Levels: Alabama Alaska Arizona Arkansas California Colorado ... Wyoming
unique(aqi$parameter_name) #only includes the PM 2.5 AQI
## [1] PM2.5 - Local Conditions
## Levels: PM2.5 - Local Conditions

Next, let’s plot our AQI data to look at it’s variation over space in this time span. Which state seems to have the best AQI at a glance? Which seem to have the worst?

# Make a plot - may take a while! It's also very large so you may want to click "zoom" to see it better.
ggplot(data = aqi) +
  geom_point(mapping = aes(x = date_local, y = aqi), size = 0.5) +
  theme_few() +
  facet_wrap(~state_name)
## Warning: Removed 4987 rows containing missing values (geom_point).

Next, we’re interested in the NC data only so let’s subset our data. At the same time, we’ll also add a new column that is binary and tells us if the AQI is under 100 (“OK) or > 100 (”Not OK")

nc_aqi <-
  aqi %>%
  dplyr::select(date_local,
                year,
                month,
                day,
                aqi,
                state_name,
                county_name,
                city_name) %>% # Step 1
  filter(state_name == "North Carolina") %>% # Step 2
  mutate(aqi_category = ifelse(aqi < 100, "OK", "Not OK"))
# Because we've changed the dimensions of our dataset (i.e. we filtered the number of rows), we want to go back through the first few steps of the EDA to ensure that our new dataframe, nc_aqi, includes only the information we are expecting.

# Check the packaging ----
str(nc_aqi)
## 'data.frame':    5782 obs. of  9 variables:
##  $ date_local  : Date, format: "2008-01-01" "2008-01-04" ...
##  $ year        : int  2008 2008 2008 2008 2008 2008 2008 2008 2008 2008 ...
##  $ month       : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ day         : int  1 4 7 10 13 16 19 22 25 28 ...
##  $ aqi         : int  21 65 47 29 60 43 57 57 43 52 ...
##  $ state_name  : Factor w/ 53 levels "Alabama","Alaska",..: 34 34 34 34 34 34 34 34 34 34 ...
##  $ county_name : Factor w/ 500 levels "Ada","Adair",..: 6 6 6 6 6 6 6 6 6 6 ...
##  $ city_name   : Factor w/ 651 levels "Aberdeen","Adjuntas",..: 74 74 74 74 74 74 74 74 74 74 ...
##  $ aqi_category: chr  "OK" "OK" "OK" "OK" ...
# Look at the top and bottom of your data ----
head(nc_aqi)
##   date_local year month day aqi     state_name county_name  city_name
## 1 2008-01-01 2008     1   1  21 North Carolina    Alamance Burlington
## 2 2008-01-04 2008     1   4  65 North Carolina    Alamance Burlington
## 3 2008-01-07 2008     1   7  47 North Carolina    Alamance Burlington
## 4 2008-01-10 2008     1  10  29 North Carolina    Alamance Burlington
## 5 2008-01-13 2008     1  13  60 North Carolina    Alamance Burlington
## 6 2008-01-16 2008     1  16  43 North Carolina    Alamance Burlington
##   aqi_category
## 1           OK
## 2           OK
## 3           OK
## 4           OK
## 5           OK
## 6           OK
#summarize your data
summary(nc_aqi)
##    date_local              year          month             day       
##  Min.   :2008-01-01   Min.   :2008   Min.   : 1.000   Min.   : 1.00  
##  1st Qu.:2008-04-03   1st Qu.:2008   1st Qu.: 4.000   1st Qu.: 8.00  
##  Median :2008-07-02   Median :2008   Median : 7.000   Median :16.00  
##  Mean   :2008-07-01   Mean   :2008   Mean   : 6.512   Mean   :15.88  
##  3rd Qu.:2008-09-30   3rd Qu.:2008   3rd Qu.: 9.000   3rd Qu.:24.00  
##  Max.   :2008-12-31   Max.   :2008   Max.   :12.000   Max.   :31.00  
##                                                                      
##       aqi                  state_name        county_name           city_name   
##  Min.   :  0.00   North Carolina:5782   Mecklenburg: 886   Not in a city: 998  
##  1st Qu.: 31.00   Alabama       :   0   Wake       : 583   Charlotte    : 532  
##  Median : 45.00   Alaska        :   0   Guilford   : 520   Raleigh      : 468  
##  Mean   : 44.67   Arizona       :   0   Forsyth    : 438   Winston-Salem: 438  
##  3rd Qu.: 57.00   Arkansas      :   0   Catawba    : 371   Greensboro   : 399  
##  Max.   :178.00   California    :   0   Pitt       : 208   Hickory      : 371  
##                   (Other)       :   0   (Other)    :2776   (Other)      :2576  
##  aqi_category      
##  Length:5782       
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
## 
# Check our OK and Not OK classifications (e.g. the new column). Save new data frame that includes only the "Not OK" values
not_ok_nc <- nc_aqi[nc_aqi$aqi_category != "OK", ]

Next, let’s check which counties in NC have data available as not all counties may be monitored, then plot the data per county available.

# Counties monitored in NC
unique(nc_aqi$county_name)
##  [1] Alamance    Buncombe    Caswell     Catawba     Chatham     Cumberland 
##  [7] Davidson    Duplin      Durham      Edgecombe   Forsyth     Gaston     
## [13] Guilford    Haywood     Jackson     Lenoir      McDowell    Martin     
## [19] Mecklenburg Mitchell    Montgomery  New Hanover Orange      Pitt       
## [25] Robeson     Rowan       Swain       Wake        Watauga     Wayne      
## 500 Levels: Ada Adair Adams Adjuntas Alachua Alamance Alameda ... Yuma
length(unique(nc_aqi$county_name)) # only 30 available of the 100 NC counties
## [1] 30
# Make a plot of the AQI, colored by our new category
ggplot(data = nc_aqi) +
  geom_point(mapping = aes(x = date_local, y = aqi, color = aqi_category), alpha = 0.5) +
  theme_minimal()

Next, let’s subset the 30 counties we have data available into ones that were identified as “exposed” or “reference” by Rappold et al. (2011). Watch your spelling!

exposed_counties <- c("Dare", "Tyrrell", "Hyde", "Beaufort", "Pamlico", "Pitt", "Craven", "Jones", "Lenoir", "Greene", "Wilson", "Wayne", "Johnston", "Sampson", "Duplin", "Chowan", "Pasquotank", "Perquimans")

reference_counties <- c("Camden", "Currituck", "Washington", "Martin", "Bertie", "Hertford", "Northhampton", "Halifax", "Edgecombe", "Nash", "Franklin", "Vance", "Warren", "Harnett", "Cumberland", "Bladen", "Robeson", "Columbus", "Brunswick", "New Hanover", "Pender", "Onslow", "Carteret")

Now, let’s make some plots of all counties, then only exposed, or only reference counties for comparison. We’re going to separate the AQI at 100 and 150 to help us view AQI more easily. We’ll also add vertical lines for the dates of the bog fires (June 10 -12, 2008) to help us view the event of concern by filtering for the month of June and putting vertical lines at day 10 and 12.

# All counties
nc_aqi %>%
  filter(month == 6) %>%
  ggplot() +
  geom_line(mapping = aes(x = day, y = aqi)) +
  geom_point(mapping = aes(x = day, y = aqi)) +
  geom_hline(yintercept = c(100, 150), color = "red") +
  geom_vline(xintercept = c(10, 12), color = "blue") +
  facet_wrap(~ county_name) +
  theme_bw()

# Exposed counties
nc_aqi %>%
  filter(county_name %in% exposed_counties) %>%
  filter(month == 6) %>%
  ggplot() +
  geom_line(mapping = aes(x = day, y = aqi)) +
  geom_point(mapping = aes(x = day, y = aqi)) +
  geom_hline(yintercept = c(100, 150), color = "red") +
  geom_vline(xintercept = c(10, 12), color = "blue") +
  facet_wrap(~ county_name) +
  theme_bw()

# Reference counties
nc_aqi %>%
  filter(county_name %in% reference_counties) %>%
  filter(month == 6) %>%
  ggplot() +
  geom_line(mapping = aes(x = day, y = aqi)) +
  geom_point(mapping = aes(x = day, y = aqi)) +
  geom_hline(yintercept = c(100, 150), color = "red") +
  geom_vline(xintercept = c(10, 12), color = "blue") +
  facet_wrap(~ county_name) +
  theme_bw()

Next, let’s also summarize our AQI for exposed counties.

# Create a data summary
expsum<- nc_aqi %>%
  filter(county_name %in% exposed_counties) %>%
  group_by(county_name, month) %>%
  summarize(mean = mean(aqi), max = max(aqi), min = min(aqi)) 
## `summarise()` has grouped output by 'county_name'. You can override using the `.groups` argument.

Assignment

Write ~5 sentences to answer the following questions:

  1. Do the AQI data, based on ground measurements, correspond to the “exposed” counties determined from satellite imagery in the Rappold et al. (2011) study?
  2. In this case, what were the advantages and disadvantages of using data from satellite images and ground instruments (i.e. AQI)?
  3. How could AQI data be used to complement exposure estimates like those collected via satellite by Rappold et al. (2011) study?

References

Nelson, N., 2020. Problem-Centered Data Science Education in the Agricultural and Biological Engineering Classroom: Analyzing Air Quality Index Data in R, in: Case Studies and Modules for Data Science Instruction. Presented at the ASABE, p. 6. https://elibrary.asabe.org/textbook.asp?confid=sci2021

Rappold, A.G., Stone, S.L., Cascio, W.E., Neas, L.M., Kilaru, V.J., Carraway, M.S., Szykman, J.J., Ising, A., Cleve, W.E., Meredith, J.T., Vaughan-Batten, H., Deyneka, L., Devlin, R.B., 2011. Peat bog wildfire smoke exposure in rural North Carolina is associated with cardiopulmonary emergency department visits assessed through syndromic surveillance. Environ Health Perspect 119, 1415–1420. https://doi.org/10.1289/ehp.1003206